这个数据集是一些模拟 Starbucks rewards 移动 app 上用户行为的数据。每隔几天,星巴克会向 app 的用户发送一些推送。这个推送可能仅仅是一条饮品的广告或者是折扣券或 BOGO(买一送一)。一些顾客可能一连几周都收不到任何推送。
顾客收到的推送可能是不同的,这就是这个数据集的挑战所在。
此任务是将交易数据、顾客统计数据和推送数据结合起来判断哪一类人群会受到某种推送的影响。这个数据集是从星巴克 app 的真实数据简化而来。因为下面的这个模拟器仅产生了一种饮品, 实际上星巴克的饮品有几十种。
每种推送都有有效期。例如,买一送一(BOGO)优惠券推送的有效期可能只有 5 天。你会发现数据集中即使是一些消息型的推送都有有效期,哪怕这些推送仅仅是饮品的广告,例如,如果一条消息型推送的有效期是 7 天,你可以认为是该顾客在这 7 天都可能受到这条推送的影响。
数据集中还包含 app 上支付的交易信息,交易信息包括购买时间和购买支付的金额。交易信息还包括该顾客收到的推送种类和数量以及看了该推送的时间。顾客做出了购买行为也会产生一条记录。
同样需要记住有可能顾客购买了商品,但没有收到或者没有看推送。
举个例子,一个顾客在周一收到了满 10 美元减 2 美元的优惠券推送。这个推送的有效期从收到日算起一共 10 天。如果该顾客在有效日期内的消费累计达到了 10 美元,该顾客就满足了该推送的要求。
然而,这个数据集里有一些地方需要注意。即,这个推送是自动生效的;也就是说,顾客收到推送后,哪怕没有看到,满足了条件,推送的优惠依然能够生效。比如,一个顾客收到了"满10美元减2美元优惠券"的推送,但是该用户在 10 天有效期内从来没有打开看到过它。该顾客在 10 天内累计消费了 15 美元。数据集也会记录他满足了推送的要求,然而,这个顾客并没被受到这个推送的影响,因为他并不知道它的存在。
清洗数据非常重要也非常需要技巧。
需要考虑到某类人群即使没有收到推送,也会购买的情况。从商业角度出发,如果顾客无论是否收到推送都打算花 10 美元,我们并不希望给他发送满 10 美元减 2 美元的优惠券推送。所以可能需要分析某类人群在没有任何推送的情况下会购买什么。
一共有三个数据文件:
以下是文件中每个变量的类型和解释 :
portfolio.json
profile.json
transcript.json
import pandas as pd
import numpy as np
import math
import json
import plotly.express as px
import scipy.stats as stats
from datetime import datetime
% matplotlib inline
import warnings
warnings.filterwarnings('ignore')
# read in the json files
portfolio = pd.read_json('data/portfolio.json', orient='records', lines=True)
profile = pd.read_json('data/profile.json', orient='records', lines=True)
transcript = pd.read_json('data/transcript.json', orient='records', lines=True)
通过以下字典和单元格了解数据的描述性统计信息。
# 查看30 天试验期间的推送(10 个推送 X 6 个字段)
portfolio
# 查看3个不同dataframe的大小
portfolio.shape, profile.shape, transcript.shape
# check null value
portfolio.isnull().sum(), profile.isnull().sum(), transcript.isnull().sum()
# 交易数据-事件记录 (306648 个事件 x 4 个字段)
transcript.head()
# 查看顾客统计数据-得到奖励的顾客的dataframe (17000 个用户 x 5 个字段)
print(profile.head())
print(profile[profile['gender'].isna()].shape) # if gender is null
print(profile[profile['age'] == 118].shape) # if age is null
print(profile[(profile['age'] == 118) & (profile['gender'].isna())].shape) # if both gender and age are null
1. clean the customer dataframe
profile = pd.read_json('data/profile.json', orient='records', lines=True)
# since gender and age both are null at same time, so delete those records
df_customers = profile.dropna()
print(df_customers.shape)
assert profile.shape[0] - df_customers.shape[0] == 2175
# process membership duration
# add new column called 'membership_since' to keep track of membership
til = '20191231'
d2 = datetime.strptime(til, '%Y%m%d')
# lambda function to process datetime
dur = lambda v: abs((d2 - datetime.strptime(str(v), '%Y%m%d')).days)
df_customers['membership_since'] = df_customers['became_member_on'].apply(dur)
df_customers.head()
2. process transaction data - value column in transaction dataframe
# merge transcript with clean customer dataframe
df_trans = pd.merge(transcript, df_customers, left_on='person', right_on='id')
del df_trans['id']
# rename 'offer id' to 'offer_id' in value column
def proc_col_value(text):
'''
INPUT
text - the text inside the 'value' column
OUTPUT
f - rename if the text has'offer id' to 'offer_id'
text - otherwise remains unchanged
'''
f = {}
for k, v in text.items():
if k == 'offer id':
f['offer_id'] = v
else:
return text
return f
# apply func to each cell and extend to 3 columns (offer_id, amount, reward )
df_tnx = pd.concat([df_trans, df_trans['value'].apply(proc_col_value).apply(pd.Series)], axis=1)
print(df_tnx.shape)
df_tnx.head()
3. merge offer information to new transaction dataframe
# merge the new transaction dataframe with portfolio based on offer id
df_x = pd.merge(df_tnx, portfolio, left_on='offer_id', right_on='id', how='left')
del df_x['id']
#del df_['value']
print(df_x.shape)
df_x.head()
# final transaction dataframe - df_x
# confirm the number of customers before and after join customers' profile dataframe
print(len(transcript['person'].unique()))
print(len(set(transcript['person'].unique()) & set(df_x['person'])))
4. aggregate information to customers data frame
# aggreget amount and reward information to each customer
s1 = df_x.groupby('person')['amount'].sum()
s2 = df_x.groupby('person')['reward_x'].sum()
df_ = pd.merge(s1, s2, on='person')
df_cu_tmp = pd.merge(df_customers, df_, left_on='id', right_on='person')
print(df_cu_tmp.shape)
df_cu_tmp.head()
def proc_offers_and_extends(col, df=df_x):
'''
INPUT
col - the column name to process
df - the merged the new transaction dataframe
OUTPUT
df_tmp - the new aggregates information of customers
Description:
process the column of the new transaction dataframe and aggregate information.
'''
customers = list(df['person'].unique())
tmp = {}
for customer in customers:
# get the aggregate information from new transcation dataframe
tmp[customer] = df[df.person == customer][col].value_counts().to_dict()
# transpost the dataframe to proper shape
df_tmp = pd.DataFrame(tmp).T
# reset the index to a column
df_tmp.reset_index(level=0, inplace=True)
return df_tmp
df_customers.head()
# find each customer's reponses to the offers in terms of 'offer completed', 'offer received', 'offer viewed'
df_events = proc_offers_and_extends('event')
print(df_events.shape)
df_events.head()
# find each customer's received offers - 'bogo', 'discount', 'informational'
df_offers = proc_offers_and_extends('offer_type')
print(df_offers.shape)
df_offers.head()
# merge the aggregate information
df_ = pd.merge(df_events, df_offers, on='index')
df_cu = pd.merge(df_cu_tmp, df_, left_on='id', right_on='index')
del df_cu['index']
df_cu.head()
df_cu.head()
df_cu.shape
df_cu.isnull().sum()
find outliers
df_cu.amount.describe()
def remove_outliers(col, df):
'''
INPUT
col - the column intend to find outliers
df - the source dataframe
OUTPUT
df - the dataframe after remove the outliers
Description:
using Tukey's Rule
'''
q1 = df[col].quantile(0.25)
q3 = df[col].quantile(0.75)
IQR_amount = q3 - q1
max_value = q3 + 1.5*IQR_amount
min_value = q1 - 1.5*IQR_amount
print(min_value, max_value)
df_ = df[(df[col] <= max_value) & (df[col] >= min_value)]
return df_
df_cu.head()
df_cu.shape
1. 顾客特征有哪些?各个特征分布如何?
df_cu['age'].describe()
df_cu.query("gender == 'M'")['age'].describe()
df_cu.query("gender == 'F'")['age'].describe()
顾客平均年龄54岁,中位数是55岁,最小18岁,最大101岁,标准差17.38。
其中男性顾客平均年龄52岁,中位数是53岁,最小18岁,最大100岁,标准差17.41。
其中女性顾客平均年龄57岁,中位数是58岁,最小18岁,最大101岁,标准差16.88。
# check the distribution of age based on gender
fig = px.histogram(df_cu, x="age", color="gender", marginal="rug", # can be `box`, `violin`
hover_data=df_cu.columns, title="顾客年龄分布")
fig.show()

# the distribution of customers' gender
fig = px.pie(df_cu['gender'].value_counts().to_frame(), values='gender',
names=df_t.index, hover_name=['Male', 'Female', 'O'], title="顾客性别分布")
fig.show()

所有顾客中,57.2%是男性, 41.3%是女性
1.1 the relationship between customer age and amount
# the relationship customer age with amount
fig = px.scatter(df_cu, x="age", y="amount", color="gender",
marginal_y="rug", marginal_x="box")
fig.show()

# remove outliers
df = remove_outliers('amount', df_cu)
fig = px.scatter(df, x="age", y="amount", color="gender", marginal_y="rug", marginal_x="histogram")
fig.show()

fig = px.scatter(df, x="age", y="amount", trendline="ols")
fig.show()

results = px.get_trendline_results(fig)
results.px_fit_results.iloc[0].summary()
从上面分析可以看出顾客的年龄和购买量有一定的正向关系,R squared: 0.020, P value < 0.05
1.2. the relationship customer income with amount
# the relationship customer income with amount using the data removed outliers
fig = px.scatter(df, x="income", y="amount", color="gender",
marginal_y="rug", marginal_x="histogram")
fig.show()

fig = px.scatter(df, x="income", y="amount", trendline="ols")
fig.show()

results = px.get_trendline_results(fig)
results.px_fit_results.iloc[0].summary()
从上面分析可以看出顾客的收入和购买量有一定的正向关系,R squared: 0.099, P value < 0.05
2. 顾客的购买行为如何刻画?对不同对推送的反应如何?
# df is the dataframe remove 'amount' outliers
fig = px.scatter(df, x="age", y="offer completed", color="gender", marginal_y="box", marginal_x="histogram")
fig.show()

fig = px.scatter(df, x="age", y="offer viewed", color="gender", marginal_y="box", marginal_x="histogram")
fig.show()

fig = px.scatter(df, x="age", y="offer received", color="gender", marginal_y="box", marginal_x="histogram")
fig.show()

df.head()
# create new dataframe to analyze the male customer purchase bahaviors based on age
df_male = df.query("gender == 'M'").groupby("age")[['amount','offer completed','transaction',
'bogo','discount','informational']].mean()
df_male.reset_index(level=0, inplace=True)
# the statics of 'offer completed' before remove outliers
df_male['offer completed'].describe()
# remove outliers based on 'offer completed'
df_male_ = remove_outliers('offer completed', df_male)
df_male_['offer completed'].describe()
# show the figure of the relationship between age and offer_completed
fig = px.scatter(df_male_, x='age',
y="offer completed",
trendline="ols",
hover_data=df_male_.columns, title="男性顾客完成推送交易的平均值")
fig.show()

res_offer_cplt = px.get_trendline_results(fig)
res_offer_cplt.px_fit_results.iloc[0].summary()
df_male_ = remove_outliers('bogo', df_male)
fig = px.scatter(df_male_, x='age',
y="bogo",
trendline="lowess",
hover_data=df_male_.columns, title='男性顾客买一送一推送的平均值')
fig.show()

男性顾客随着年龄的增加对推送的正向响应有增加对趋势。年龄小于57岁对男性顾客对买一送一对推送有正向关系,年龄在57岁以上此趋势不明显。
fig = px.scatter(df_male, x='age',
y="amount",
trendline="ols",
hover_data=df_male.columns,title="男性顾客平均购买量")
fig.show()

res_amount = px.get_trendline_results(fig)
res_amount.px_fit_results.iloc[0].summary()
df_male['transaction'].describe()
fig = px.scatter(df_male, x='age',
y="transaction",
trendline="lowess",
hover_data=df_male.columns, title="男性顾客平均购买次数")
fig.show()

男性顾客随着年龄的增加,平均的购买的交易次数有负相关的趋势,但是60岁以后持平, 并且平均购买量有随年龄增加而增加的趋势。
df_male_ = remove_outliers('discount', df_male)
fig = px.scatter(df_male_, x='age',
y="discount",
trendline="ols",
hover_data=df_male_.columns, title="男性顾客折扣推送平均值")
fig.show()

res_dsc = px.get_trendline_results(fig)
res_dsc.px_fit_results.iloc[0].summary()
男性顾客随着年龄的增加,对折扣的响应数有正向相关的趋势。
df_male_ = remove_outliers('informational', df_male)
fig = px.scatter(df_male_, x='age',
y="informational",
trendline="lowess",
hover_data=df_male_.columns, title="男性顾客信息推送的平均值")
fig.show()

df_female = df.query("gender == 'F'").groupby("age")[['amount','offer completed','transaction',
'bogo','discount','informational']].mean()
df_female.reset_index(level=0, inplace=True)
df_female['offer completed'].describe()
fig = px.scatter(df_female, x='age',
y="offer completed",
trendline="ols",
hover_data=df_female.columns, title="女性顾客完成推送交易的平均值")
fig.show()

res_oc = px.get_trendline_results(fig)
res_oc.px_fit_results.iloc[0].summary()
df_female_ = remove_outliers('bogo', df_female)
fig = px.scatter(df_female_, x='age',
y="bogo",
trendline="lowess",
hover_data=df_female_.columns, title='女性顾客买一送一推送的平均值')
fig.show()

女性顾客随着年龄的增加对推送的正向响应比较平均没有明显的正向或是负向的趋势。pvalue=0.51, 大于0.05,不能拒绝零假设。
df_female_ = remove_outliers('transaction', df_female)
fig = px.scatter(df_female_, x='age',
y="transaction",
trendline="lowess",
hover_data=df_female_.columns, title="女性顾客平均购买次数")
fig.show()

fig = px.scatter(df_female_, x='age',
y="amount",
trendline="lowess",
hover_data=df_female_.columns,title="女性顾客平均购买量")
fig.show()

同男性顾客有这相似的趋势, 女性顾客随着年龄的增加,和平均的购买交易次数有负相关的趋势,但是60岁以后持平。另外平均购买量是随年龄增加而增加, 但是70岁以后有下降的趋势。
df_female_ = remove_outliers('discount', df_female)
fig = px.scatter(df_female_, x='age',
y="discount",
trendline="lowess",
hover_data=df_female_.columns, title="女性顾客折扣推送平均值")
fig.show()

df_female_ = remove_outliers('informational', df_female)
fig = px.scatter(df_female_, x='age',
y="informational",
trendline="lowess",
hover_data=df_female_.columns, title="女性顾客信息推送的平均值")
fig.show()

60岁以下女性顾客随着年龄的增加,平均对折扣推送有正相关的趋势,但是60岁以后有所下降。48岁以下对女性客户随着年龄的增加,平均对信息推送有正相关的趋势,但是48岁以后趋势不明显。
# Group 1 -> income less than 50k, all customers
df_income1 = df.query("income < 50000 ")[['amount',
'offer completed','transaction',
'bogo','discount','informational']]
df_income1.shape
# Group 1_1 -> income less than 50k and male customer
df_income1_1 = df.query("(income < 50000) and (gender == 'M') ")[['amount',
'offer completed','transaction',
'bogo','discount','informational']]
df_income1_1.shape
# Group 1_Female -> income less than 50k and female customer
df_income1_2 = df.query("(income < 50000) and (gender == 'F') ")[['amount',
'offer completed','transaction',
'bogo','discount','informational']]
df_income1_2.shape
# comparison test between the male and female customers whose income are less than 50k
t_test1 = t_test(df_income1_1, df_income1_2)
t_test1
从以上比较可以看出,收入低于50k以下的男顾客和女顾客在消费量,完成推送交易,购买次数上有显著性,可以拒绝零假设。
def t_test(df1, df2):
'''
INPUT
df1 - the dataframe, sample size by feature size
df2 - the dataframe, sample size by feature size,
OUTPUT
t_score - the list of t-score of each feature
p_value - the list of p-value of each feature
Description:
The 2 dataframe must has same feature size, the number of columns must be same and comparable.
'''
# check if comparable - the number of columns
assert df1.shape[1] == df2.shape[1]
# get the number of features
fea = df1.shape[1]
# get the sample size - the number of rows
n1 = df1.shape[0]
n2 = df2.shape[0]
# get sample variance and sample mean
var1 = []
var2 = []
mean1 = []
mean2 = []
for idx in range(fea):
var1.append(df1.iloc[:,idx].var())
var2.append(df2.iloc[:,idx].var())
mean1.append(df1.iloc[:,idx].mean())
mean2.append(df2.iloc[:,idx].mean())
# get t-score and p-value
t_score = []
p_value = []
for i in range(len(var1)):
t = (mean2[i]-mean1[i])/np.sqrt(var1[i]/n1 + var2[i]/n2)
t_score.append(t)
# degrees of freedom
df = n1 + n2 - 2
# p-value after comparison with the t
p = 1 - stats.t.cdf(t, df=df)
p_value.append(p)
return t_score, p_value
# Group 2 -> income between 50k and 75k, all customers
df_income2 = df.query("income >= 50000 and income <= 75000")[['amount','offer completed','transaction',
'bogo','discount','informational']]
df_income2.shape
# Group 2_1 -> income between 50k and 75k, male customers
df_income2_1 = df.query("income >= 50000 and income <= 75000 and gender =='M'")[['amount',
'offer completed','transaction',
'bogo','discount','informational']]
df_income2_1.shape
# Group 2_2 -> income between 50k and 75k, female customers
df_income2_2 = df.query("income >= 50000 and income <= 75000 and gender =='F'")[['amount',
'offer completed','transaction',
'bogo','discount','informational']]
df_income2_2.shape
# comparison test between the male and female customers whose income are in the reage of 50k to 75k
t_test2 = t_test(df_income2_1, df_income2_2)
t_test2
从以上比较可以看出,收入在50k和75k之间的男顾客和女顾客在消费量,完成推送交易上有显著性,可以拒绝零假设。在消费次数上没有显著性,不能拒绝零假设。
# Group 3 -> income between greater than 75k, all customers
df_income3 = df.query("income > 75000 ")[['amount','offer completed','transaction',
'bogo','discount','informational']]
df_income3.shape
# Group 3_1 -> income between greater than 75k, male customers
df_income3_1 = df.query("income > 75000 and gender == 'M'")[['amount','offer completed','transaction',
'bogo','discount','informational']]
df_income3_1.shape
# Group 3_2 -> income between greater than 75k, female customers
df_income3_2 = df.query("income > 75000 and gender == 'F'")[['amount','offer completed','transaction',
'bogo','discount','informational']]
df_income3_2.shape
# comparison test between the male and female customers whose income are greater than 75k
t_test3 = t_test(df_income3_1, df_income3_2)
t_test3
从以上比较可以看出,收入高于75k的男顾客和女顾客在消费量,完成推送交易,在消费次数上都没有显著性,不能拒绝零假设。
# comparison test between the customers whose income are less than 50k and the customers
# whose income are in the reage of 50k to 75k
t_test4 = t_test(df_income1, df_income2)
t_test4
从以上比较可以看出,收入低于50k的顾客和收入在50k和75k之间顾客在消费量,完成推送交易,存在显著性,能拒绝零假设。在消费次数上都没有显著性,不能拒绝零假设。
# comparison test between the customers whose income are less than 50k and the customers
# whose income are greater than 75k
t_test5 = t_test(df_income1, df_income3)
t_test5
从以上比较可以看出,收入低于50k的顾客和收入高于75k的顾客在消费量,完成推送交易,存在显著性,可以拒绝零假设。在消费次数上都没有显著性,不能拒绝零假设。
# comparison test between the customers whose income are greater than 50k and the customers
# whose income are in the reage of 50k to 75k
t_test6 = t_test(df_income3, df_income2)
t_test6
从以上比较可以看出,收入在50k和75k之间顾客和收入高于75k的顾客在消费量,完成推送交易,不存在显著性,不能拒绝零假设。在消费次数上有显著性,可以拒绝零假设。
# get and check train data
df_train = df_cu[['age', 'gender', 'income', 'membership_since', 'amount',
'offer completed', 'offer received', 'offer viewed']]
df_train.head(), df_train.shape, df_train.isnull().sum()
# fill nan to 0
df_train.fillna(0, inplace=True)
# preprocess the gender field
df_train = pd.concat([df_train, pd.get_dummies(df_train['gender'])], axis=1)
del df_train['gender']
# remove outliers
df_train = remove_outliers('amount', df_train)
df_train.shape
df_train.head()
# preprocess the train data by standardization
X = df_train[['age', 'income', 'membership_since', 'F', 'M', 'O',
'offer completed', 'offer received', 'offer viewed']].values
Y = df_train['amount'].values
# standardize the first 3 columns
scaled_X = {}
for each in range(3):
mean, std = X[:,each].mean(), X[:,each].std()
scaled_X[each] = [mean, std]
X[:, each] = (X[:, each] - mean)/std
y = (Y - Y.mean())/Y.std()
# import libs
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV, KFold, cross_val_score
from sklearn.ensemble import RandomForestRegressor
from sklearn import metrics
from sklearn.metrics import r2_score
# train test data split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=1)
X_train.shape, X_test.shape, y_train.shape, y_test.shape
# build random forest regressor model pipeline
pipeline = Pipeline([
#('scalar', StandardScaler()),
('regr', RandomForestRegressor())
])
parameters = {
'regr__n_estimators': (10, 50, 100, 1000),
'regr__max_depth': range(3,7),
}
model = GridSearchCV(pipeline, param_grid=parameters)
model.fit(X_train, y_train)
print(model.best_estimator_)
y_pred = model.predict(X_test)
# find r square score
print(r2_score(y_test, y_pred))
print("MSE:", metrics.mean_squared_error(y_test, y_pred))
# build alternative neural network model using Keras
import pandas
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasRegressor
# find input size
input_size = X.shape[1]
# define base model
model = Sequential()
model.add(Dense(64, input_dim=input_size, kernel_initializer='normal', activation='relu'))
model.add(Dense(16, kernel_initializer='normal', activation='relu'))
model.add(Dense(16, kernel_initializer='normal', activation='relu'))
model.add(Dense(1, kernel_initializer='normal'))
# Compile model
model.compile(loss='mean_squared_error', optimizer='adam')
# train the model
kfold = KFold(n_splits=10)
batch_size = 64
epoches = 80
cnt = 0
for train_index, test_index in kfold.split(X_train):
cnt += 1
print("Fold {}".format(cnt))
X_tr, X_te = X_train[train_index], X_train[test_index]
y_tr, y_te = y_train[train_index], y_train[test_index]
r = model.fit(X_tr, y_tr,
batch_size=batch_size,
epochs=epoches,
shuffle=True,
verbose=1,
validation_data = (X_te, y_te))
model.summary()
y_pred=model.predict(X_test)
print(r2_score(y_test, y_pred))
print("MSE:", metrics.mean_squared_error(y_test, y_pred))
1. 发现顾客以下特征。
男性顾客随着年龄的增加与平均完成推送交易有正向相关的趋势。年龄小于57岁对男性顾客对买一送一对推送有正向关系,年龄在57岁以上此趋势不明显。
60岁以下的男性顾客随着年龄的增加,平均购买次数有负相关的趋势,但是60岁以上的男性顾客此趋势不明显, 另外平均购买量有随年龄增加而增加的趋势。
男性顾客随着年龄的增加,对折扣的响应数有正向相关的趋势。
女性顾客随着年龄的增加对推送的正向响应比较平均没有明显的正向或是负向的趋势。
同男性顾客有这类似的趋势, 女性顾客随着年龄的增加,平均购买次数有负相关的趋势,但是60岁以后持平。另外平均购买量是随年龄增加而增加, 但是70岁以后有下降的趋势。
60岁以下女性顾客随着年龄的增加,平均对折扣推送有正相关的趋势,但是60岁以上有所下降。48岁以下对女性客户随着年龄的增加,平均对信息推送有正相关的趋势,但是48岁以上趋势不明显。
收入低于50k以下的男顾客和女顾客在消费量,完成推送交易,购买次数上有显著性,可以拒绝零假设。
收入在50k和75k之间的男顾客和女顾客在消费量,完成推送交易上有显著性,可以拒绝零假设。在消费次数上没有显著性,不能拒绝零假设。
收入高于75k的男顾客和女顾客在消费量,完成推送交易,在消费次数上都没有显著性,不能拒绝零假设。
收入低于50k的顾客和收入在50k和75k之间顾客在消费量,完成推送交易,存在显著性,可以拒绝零假设。在消费次数上都没有显著性,不能拒绝零假设。
收入低于50k的顾客和收入高于75k的顾客在消费量,完成推送交易,存在显著性,可以拒绝零假设。在消费次数上都没有显著性,不能拒绝零假设。
收入在50k和75k之间顾客和收入高于75k的顾客在消费量,完成推送交易,不存在显著性,不能拒绝零假设。在消费次数上有显著性,可以拒绝零假设。
2.机器学习建模。
以顾客的年龄,收入,性别,会员天数, 以及汇总的交易信息为特征,来预测消费量。
Baseline模型为 Random Forest Regressor,R^2为0.61, 备选NN模型,R^2为0.58